Imports and Initializations

Note: This document cannot be converted to pdf due to the presence of plotly graphs which require a paid subscription for the pdf service. Instead, please view the .html version of the document. The graphs are interactive - try hovering your mouse over points on it!

Function Declarations

Question 2.1

Question 2.1

In this part, you will asked to build a model to forecast the hourly readings in the future (next hour).

  1. Can you explain why you may want to forecast the gas consumption in the future? Who would find this information valuable?
  2. What can you do if you have a good forecasting model?

As one of the fundamental driving forces of economic activities of the world, energy is a crucial consideration in many key decision making processes. Due to its non-renewable nature, and rapidly increasing demand, it is important to use fossil fuels as efficient as possible. Despite falling on the category of fossil fuels, natural gas combustion emits less greenhouse gas and places it as a cleaner and safer option as compared to other fossil fuels such as coal or oil.

A good forecasting model will be able to allow power and gas utility supplier companies to predict periods for which a certain area would experience higher increase in demand. Subsequently, accurate underground stock optimization would allow companies to prevent overstock, which would prove to be costly as the unusable gas would still need to be paid due to contractual agreement. In addition, the prevention of under stocking is also highly important to prevent downtimes and other catastrophic repercussions from inability to meet demand.

Question 2.2

Build a linear regression model to forecast the hourly readings in the future (next hour).

Generate two plots:

1. Time series plot of the actual and predicted hourly meter readings

2. Scatter plot of actual vs predicted meter readings (along with the line showing how good the fit is)

In creating the predictions, we hypothesized that given the last 6 hours of usage you can predict the next hour regardless of the house you're in. This gets rid of the different gradient and different starting point problems for each house especially since dataid is an arbitrary parameter. After trying for 24, 12, and 6 hours we found 6 hours to be the optimal choice and defined the value in LOOKBACK_PERIOD variable.

The MSE error in test set is lower than in train set, signifying model has good generalizability and is capable of prediction unfitted data better than train set. There is unlikely to be overfit in the train set.

Architecture for inference:

[Raw-data] -> [PreProcessing] -> [ModelFrontend] -> [Model] -> [Predictions]

We present different modes of predictions with our model, namely simulate_operation and long_term_prediction. Instead of making predictions directly over all available date, prediction is done with sampling windows with size defined by LOOKBACK_PERIOD.

Question 2.2 (a) - Time series plot of the actual and predicted hourly meter readings with linear regression

In this initial scenario we carry out predictions using data from a single house with dataid 35. It could be observed that the model predicts the meter value usage in a pessimistic manner by using hourly simulation. This is likely because the model learns incremental increase in hourly window, therefore if we increase LOOKBACK period, it would cause the model to be more optimistic.

One noticable issue is since the input data fed into the model was imputed, that imperfections that exist in imputation can affect the model significantly.

In this scenario we made use of mean readings from the increase (val_diff) of meter value reading for each houses in the dataframe as seen in how the dataframe is processed with function mean_readings_for_area. The model generally does well on "mean" data for the entire area, capable of predicting better compared to using actual cumulative meter value reading previously. It could be theorized that doing linear regression with mean data will give our model better ability of generalization, for which the theory is confirmed in the result that was obtained above. Customizing prediction with linear regression by only using reading of single house leads to results that is akin to overfitting. It could be argued that there is lower deviation in the mean hourly val_diff dataset as compared to using cumulative meter value reading, hence prediciton error with linear regression is less drastic and results in lower changes in next data point. One can use this model to better predict the average gas usage of the entire area over the next hour.

However, it must be kept in mind that long term predictions are still very poor and unreliable. Our group attempted to predict multiple hours in front in an attempt to improve the prediction but did not see any considerable impact on the prediction results.

Question 2.2 (b) - Scatter plot of actual vs predicted meter readings (along with the line showing how good the fit is) with linear regression

In the graph above, scatter plot of the actual data of a single house (dataid 35) is plotted along with the hourly prediction result with linear regression from Q2.2(a). In initial view one could evidently see that between the scattered plots of actual and prediction results the accuracy is relatively satisfactory. The trendline is taken by fitting results of linear regression prediction of a single house with ordinary least squares method. The linear regression model fares generally well as compared to the scattered plot of actual data. Unfortunately predictions introduced spikes, possibly due to subsequent changes in points in window for linear regression.

We now attempt to extract trendline by fitting results of linear regression prediction of mean houses with ordinary least squares method. Scattered actual mean data points are plotted as well as the prediction of mean houses from hourly linear regression. With the same argument previously, by using mean value of val_diff we can extract general trends that affects all houses instead of a single house. This shows that the mean change in meter value for houses tend to have linear relationship. In addition less spikes are introduced with long term predictions with linear regression.

Question 2.3

Do the same as Question 2.2 above but use support vector regression (SVR).

Generate two plots:

1. Time series plot of the actual and predicted hourly meter readings

2. Scatter plot of actual vs predicted meter readings (along with the line showing how good the fit is)

Question 2.3 (a) - Time series plot of the actual and predicted hourly meter readings with SVR

SVR model with RBF kernel was also explored but later abandoned as model training time is considerably long and unrealistic given the project timeline. This is because the added complexity in using Radial Basis Function calculation for RBF SVR. Theoretically, SVR with RBF kernel would be able to project data into higher dimensions, therefore allowing better generalizability than SVR with linear kernel. RBF kernel SVR would be better in regression with non-linear data points. This would allow an input data with relationship that is non-linear to be identified. However in the case of our project it could be argued that using RBF is unnecessary as the meter value usage likely has a linear relationship as predictionability was demonstrated with linear regression. For the same reason, the approach of using polynomial kernel was foregone as our data points can be regressed linearly.

(Please refer to the appendix for results with RBF)

In the graph above, time series plot of actual data, data after imputation, predicted data, and data with long term prediction are plotted. Just like in linear regression, prediction is preliminarily done over a single house with dataid 35. Theoretically, SVR is expected to provide better prediction result over the data due to its iterative nature in finding the best pattern. Evidently, as compared to linear regression, using SVR with linear kernel has been shown to provide less overfit and better generalizability to the overall data.

Using long term predictor to predict over prediction, the linear SVR has been more accurate in predicting the increase in meter value, although rapidly increasing long term prediction value in around 2nd of October causes the overall error to be subsequently carried into further predictions values. Long prediction method also allows for polynomial-like fitting. Unfortunately this means that error could cause the long term prediction graph to go down if predictions decrease in multiple window steps over time, leading to prediction results that 'bend downwards' in some cases. This means meter value is udner-predicted, which is not an ideal case for gas companies. Exploration with RBF SVR has also been shown that predictions with RBF SVR model exhibits said behaviour more consistently.

In this scenario mean readings are again used by using mean_readings_for_area fuction to preprocess. The model generally does well on "mean" data for the entire area, capable of predicting better compared to using actual cumulative meter value reading previously. As compared to linear regression, the mean predictions are better, and offers better generalization than simply using reading from single house. This is because of the same reason as explained in 2.2(a) where there is lower deviation in the mean hourly val_diff dataset as compared to using cumulative meter value reading, hence predicition error with SVR is less drastic, resulting in lower changes. The long term prediction is also shown to be more accurate in SVR compared to linear regression. From early October until 3 October long term prediciton is within reasonably margin of error as compared to mean prediction and actual data, before diverging due and having the error accumulated and influencing the subsequent predictions. One can use this model to better predict the average gas usage of the entire area over the next hour.

Question 2.3 (b) - Scatter plot of actual vs predicted meter readings (along with the line showing how good the fit is) with SVR

In the graph above, scatter plot of the actual data of a single house (dataid 35) is plotted along with the hourly prediction result with linear kernel SVR from Q2.3(a). Results are similar to linear regression model.

In this analysis the long time prediction trendlines are also added and was shown to be better than linear regression long time prediction trendline.

Appendix: Results with other kernels (RBF SVR) and

Warning: takes a really long time to train!